Bioinformatics (Thomas Dandekar, Meik Kunz)

206

• Negative predictive value (NPV, segregancy; probability of actually being nega

tive) =

raction normal upper T normal upper N Over normal upper T normal upper N plus normal upper F normal upper N EndFraction

• Accuracy (correct classification rate) =

ormal upper T normal upper P plus normal upper T normal upper N Over normal upper T normal upper P plus normal upper F normal upper P plus normal upper T normal upper N plus normal upper F normal upper N EndFraction

• Misclassification rate =

ormal upper F normal upper P plus normal upper F normal upper N Over normal upper T normal upper P plus normal upper F normal upper P plus normal upper T normal upper N plus normal upper F normal upper N EndFraction

• Prevalence

(proportion

actually

positive

persons

the

total

number) =

ormal upper T normal upper P plus normal upper F normal upper N Over normal upper T normal upper P plus normal upper F normal upper P plus normal upper T normal upper N plus normal upper F normal upper N EndFraction

For the graphical representation, a ROC curve (Receiver Operating Characteristic; x-axis:

false positive rate, y-axis: sensitivity) is often used, where the AUC (Area Under the

Curve) is a measure of the quality of the classification (higher AUC value = better classi

fication). An ideal classification model has a 100% true positive rate (100% sensitivity)

and 0% false positive rate (100% specificity). But this is not always the case in reality. For

example, in a recent paper, we were able to show that a novel real-time PCR has better

predictive power for the detection of Trypanosoma cruzi in a Chagas disease and is supe

rior to previous PCR methods here, but is just not 100% accurate (Kann et al. 2020). In any

case, it is advisable to always create a prediction model on the basis of a training and test

data set and to validate it on at least one independent data set in order to be able to reliably

assess its predictive power for a possible application, such as a clinical decision sup

port system.

Artificial Neural Networks Another possibility for machine learning is the use of simple

neural networks, which consist of input a simple intermediate layer and an output.

Connections between these three layers are strengthened or weakened so that the output is

as accurate as possible. To do this, the neural network is trained on a training dataset (auto

matically: unsupervised; with human review: supervised) and then its accuracy is checked

on another test dataset. This can then be used to generate an optimal prediction for helix

and beta boundary regions in protein structures (PredictProtein software, https://predict

protein.org) and to determine protein localization. The deep learning approach extends

the simple neural network by several layers of intermediate neurons, which in particular

then get by with fewer neurons in the later layers (and thus bring results together, “con

verge”). This replicates – in very simplified terms – an abstraction of the many inputs to

more general terms. These networks are more complex to train (“back-propagation” and

other steps) but, often further improved with other strategies from artificial intelligence

research, also create amazing things, such as optical image recognition of leukemia cells

through improved swarm optimization (Sahlol et al. 2020) or the automatic recognition of

secondary structure and oligonucleotides in electron micrographs (Mostosi et al. 2020), so

that eventually even antibiotics can be discovered with this deep learning approach (Stokes

et al. 2020) or the energy potentials and thus also the three-dimensional structure of pro

teins (Senior et al. 2020), now culminating in large-scale and accurate deep-learning based

prediction of human proteins (Tunyasuvunakool et al. 2021).

14 We Can Think About Ourselves – The Computer Cannot